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ABSTRACT 

Four researchers at the Educational Testing Service 
describe what they consider some of the t.iost vexing research problems 
they face. While these problems are not completely statistical, they 
all have major statistical components. Following the introduction 
(section 1), in section 2, "Problems with the Simultaneous Estimation 
of Many True Scores," Charles Lewis describes a technical problem 
that occurs in taking a Bayesian approach to traditional test theory. 
In the third section, "Test Theory Reconce i ved , " Robert J. Mislevy 
explains problems involved in reconceiving old approaches and new 
theories. Section 4, "Allowing Examinee Choice in Exams" by Howard 
Wainer, discusses the general problem of nonignorable nonresponse in 
the circumstance in which examinees choose to answer only a small 
number of test items from a larger sample. The fifth section, "Some 
Statistical Issues Facing NAEP," by Eugene G. Johnson, describes the; 
inferences that are occurring within the National Assessment of 
Educational Progress due to nonignorable response. (Contains 28 
references.) (SLD) 



it it it it * sV >V it * it it it it it it it it it i; it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it i; it i: it it i: it it it it it it it it it it it it it it it it 

'' : Reproductions supplied by EDRS are the best that can be made '' : 
* from the original document. * 

it it it it it it it it it it it it it it it it it it it it it it it it it i; it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it it 



RR-92-68 



CO 
CO 

Q 



A Sampling of Statistical Problems 
Encountered at the Educational Testing 

Service 



U.S. DEPARTMENT OF EDUCATION 
Oftca 0/ Educational Raaaarch and Improvamanl 

ZTIONAl RESOURCES INFORMATION 
CEN T ~R (ERIC) 
- documani hat baan raoroduccd as 
'•caivad Ifom lha pwaon w organization 
or<g<nat<nQ it 

□ Minoi cnangas hava baan mada lo vnprova 
reproduction quality 

• ^ntt of viaw or opinions Matad in thitdocu- 
mani do not nacauanly '•P'Mant official 
OERI poaitton or policy 



Howard Wainer 
Eugene C. Johnson 
Charles Lewis 
Robert J. Mislevy 
Educational Testing Service 



PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THL EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 




PROGRAM 
STATISTICS 
RESEARCH 



TECHNICAL. REPORT NO. 92-26 

tsl 

Educational Testing Service 

T t Princeton, New Jersey 08541 

r-4 



c 



ERjC 2 

BEST COPY AVAILABLE 



The Program Statistics Research Technical Report Series is designed to 
make the working papers of the Research Statistics Group at Educational Testing 
Service generally available. The series consists of reports by the members of the 
Research Statistics Group as well as their external and visiting statistical 
consultants. 

Reproduction of any portion of a Program Statistics Research Technical 
Report requires the written consent of the author(s). 



A Sampling of Statistical Problems 
Encountered at the Educational Testing 
Service 



Howard Wainer 
Eugene C. Johnson 

Charles Lewis 
Robert J. Mislevy 
Educational Testing Service 



Program Statistics Research 
Technical Report No. 92-26 



Research Report No. 92-68 



Educational Testing Service 
Princeton, New Jersey 08541 



November 1992 



Copyright © 1992 by Educational 'Feting Service. All rights reserved. 



ERIC 



A Sampling of Statistical Problems 
Encountered at the Educational Testing 

Servicei 

Howard Warner Eugene G. Johnson Charles Lewis Robert J. Mislevx 

Educational Testing Service 
Princeton NJ, USA 

I. Introduction 

The Educational Testing Service (ETS) was founded in 1947 by the American 
Council on Education, the Carnegie Foundation for the Advancement of Teaching, and 
the College Board. The primary purpose of ETS is to serve and improve education 
through development and use of high quality measurement procedures and carefully per- 
formed research and related services. 

ETS conducts research on measurement theory and practice, leaching and 
learning, and educational policy. Research at ETS has four essential missions: 

1. Basic research, embracing both the technical and the substantive foundations of 
educational measurement, conducted in support of the goals of ETS and its 
clients. 

2. New product research which currently emphasizes innovative uses of technol- 
ogy in support of education and measurement, and the development of assess- 
ment techniques that contribute to more effective teaching and learning. 

3. Research to enhance and maintain the technical quality of tesis including 
methodological, psychometric, and statistical studies. 

4. Public service research provides program evaluation for a variety of clients and 
is also involved with policy research. Policy research deals with the implica- 



'ITiis paper has profiled from the wisdom and carol ill reading by our colleague Hill Ward. 



tions of judicial and legislative actions and with issues of access and equity for 
women and minorities. 

In this paper four researchers at ETS describe what they consider some of the 
most vexing problems they face. While these problems are not all completely statistical, 
they all have major statistical components. 

In the first section Charles Lewis describes a technical problem that occurs in ^ 
taking a Bayesian approach to traditional test theory. This problem seems to cut to the 
core of the sorts of models he describes. These linear models (so-called 'true score mod- 
els') are the basis of most test scoring schemes and so the problem he describes has 
analogs in many other fields. 

In the second section Robert J. Mislevy explains the growing dissatisfaction with 
models of the sort described previously, and points toward the need for a broader outlook. 
This broader viewpoint has yet to be rigorously characterized; such rigor is sorely needed 
to avoid serious errors of interpretation that seem too often to sneak into current educa- 
tional measurement. 

The third section of this paper, by Howard Wainer, discusses the general problem 
of nonignorable nonresponse in one circumstance, specifically an increasingly popular 
innovation in testing practice — allowing examinees to choose to answer only a small 
number of test items from a larger selection. He points out the pitfalls of such a practice 
and laments its consequences. 

The fourth section, by Eugene G. Johnson, describes the inferences that are oc- 
curring within the ongoing American educational survey called the "National Assessment 
of Educational Progress" (NAEP) due to nonignorable nonresponse. By law, students and 
schools may opt not to participate in the assessment. The problems caused by nonre- 
sponse and the current methods of adjustment are discussed. In response to the educa- 
tional reforms described by Mislevy and Wainer NAEP uses new testing methodologies. 
Johnson describes some of these and indicates some statistical issues they engender. 
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II. Problems with the simultaneous estimation of many true scores 

by 

Charles Lewis 

A basic goal of psychometric theory is to make inferences about assumed 
underlying states of knowledge that individuals possess on the basis of their observed 
behavior. In the so-called classical theory of mental tests (see Lord and Novick, 1968, for 
a complete treatment of the subject), observed test scores are described as the sum of a 
true score and an error score: 

X = T + E , (1) 

with, by the definition of T, 

E(X\T) = T , (2) 
so that the variance of the observed scores may be expressed as the sum of the variances 
of the components: 

4 = <f r + cr* . (3) 
Moreover, it is common to assume that 'fand E have independent normal distributions 
(with means p and 0, respectively). In other words, a one-way, random effects analysis 
or variance model is adopted for observed test scores. 

In addition to the usual interest in making inferences about the variance compo- 
nents cf r and a\ , mental test theory focuses attention on individual true scores. In the 
model just described, the conditional distribution of 7". given X, is normal with 

E(T\X) = ^4 ± 4 E (4) 
<4 + °E 

and 

c?(T\X)~-^L . (5 , 

Equations 4 and 5 assume that the mean and variance components are known. A standard 
practice, sometimes referred to as Empirical Bayes estimation (Braun, 1989), has been to 
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estimate (J. , a\ and o\ , and insert these estimates into Equations 4 and 5 as the basis for 
inferences about true scores, given observed test scores. 

To address this problem more formally, we may adopt a Bayesian framework 
(following, for instance, Novick, Jackson and Thayer, 1971; see Lewis, 1989, for addi- 
tional references and discussion). The first step is to obtain a posterior distribution for a 
set of true scores and the unknown population parameters, given a set of observed scores. 
For purposes of this discussion, let us restrict our attention to the case of one test score 
per individual, and assume that a\ is known. With a vague prior density for \i and of , 
and a total of m individuals, the posterior density for all parameters has the form 



(6) 



p(T,4,/i!X)oc(4)- m/2 exp -£(X,-T t ? / (2^)-^ - /<2o*) 

As m increases, our knowledge about \l and a l T becomes more precise, and the 
posterior mean and variance of T t approach the expressions given in Equations 4 and 5 
(Box and Tiao, 1973). To further study the posterior density given in Equation 6, it will 
be useful to consider its joint mode. First, it may be shown that the modal values of the 
T, will have the form given in Equation 4, with the modal estimates of /j and a\ substi- 
tuted for the true values: 



(7) 



From Equation 6, it also follows that the modal estimate of H equals the mean of the 
modal values of the T, , which in turn equals the mean of the observed test scores: 

fi=X . t«) 
Substituting the values from Equations 7 and 8 into Equation 6 and simplifying yields the 
following function of <f r , which may be maximized to obtain the joint mode of the pos- 
terior density: 

m 

, y L {X >~ X) ' 

1 iH 



//(<) = (a?-) exp 



<Jt + a", 



a 



(9) 



Page - 4 



There is, however, a problem with Equation 9, namely that 

lim g(<4) = 00 ■ (10) 

Or ->0+ 

In other words, the joint posterior density may be made arbitrarily large with sufficiently 
small values of <x£. The corresponding limiting modal estimates for the T t may be de- 
rived from Equation 7 as 

f^X (11) 
for all /, regardless of the value of the observed score X, . The estimates given in 
Equation 1 1, sometimes referred to as completely regressed or pooled estimates, obvi- 
ously have no practical value for the reporting of test results. 

It may be useful to take a closer look at the behavior of the joint posterior density 
as a function of a\ . Consider, as an example, a case with 100 observed test scores, 
whose sample variance is 5.0, and for whom the error variance is known to be 1.0. Based 
on Equation 3, a consistent estimate of true score variance would be 4.0. Figure 1 shows 
the form of log 10 gia}) for o? r between 0.0 and 10.0. Besides illustrating the result in 
Equation 10, namely that the density increases without limit as o? r approaches zero, 
Figure 1 shows a secondary mode at 2.6, with a density almost ten times that in the 
neighborhood of 4.0. Thus, ignoring the limiting behavior of the joint posterior and re- 
stricting attention to interior modes still produces values for the true scores in this exam- 
ple which show substantially more regression than would be expected on the basis of the 
posterior means of the T, . 



Figure 1. 



Joint Posterior Density with 100 Examinees, 
Sample Variance 5, and Error Variance 1 




0 
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To study interior modes more generally, one may set the derivative of the loga- 
rithm of g with respect to cr,. equal to zero and solve for a 2 r . The result may be written 

as 

1+71-4(4/4 > 



®T ~ S X 



(12) 



where s\ has been used to denote the sample variance of the observed lest scores. Of 
course, this expression can only be evaluated when 



(13) 



For smaller values of the sample variance, there is no interior mode for the joint posterior 
density and, consequently, no alternative to the complete regression mode. 

Assuming Inequality 13 holds, we may compare the result given in Equation 12 
with the alternative 
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It is clear that Equation 14 will always produce the larger of the two values, resulting in 
less regression for the corresponding true score values. In fact, as 4 approaches the 
lower limit given in Inequality 13, a? will approach three times a l T . For larger values 
of 4 , the difference between the two estimates approaches . In other words, for 
4 » 4a£. 

67. = 4 ~ 2o k • (15) 
What is perhaps most disturbing about these results is that they are unaffected by 
the number of observed test scores being analyzed. That is, although b\ converges in 
probability to the true value of d\ as the number of test scores increases without limit, 
the same cannot be said of a\. . Consequently, the joint posterior modal estimates of the 
true scores -- even if an interior mode exists and we restrict our attention to it -- do not 
approach the 'true' regression estimates given in Equation 4. 

In conclusion we may say that Empirical Bayes estimates of true scores are 
clearly appropriate and receive asymptotic support from the behavior of the posterior true 
score means. Nonetheless, the asymptotic properties of the posterior mode and, hence, of 
the joint posterior density, demonstrate that there are still gaps in our understanding of the 
inferential foundations for this important problem in mental test theory. 



III. Test Theory Reconceived 
by 

Robert J. Mislevy 2 

Introduction 

Educational test theory is a corpus of concepts, models, and methods for making 
inferences about students' proficiencies. The principles of statistical inference are thus 
brought to bear on practical problems in selection, instruction, and evaluation. Recent 
decades have witnessed considerable progress in models and methods, but the conceptual 

2 Hiis paper lias benefited from conversations with Henry Braun, Charles Lewis, and Howard Wainer. 
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foundations have advanced precious little in the past century. The problem: The standard 
models of test theory evolved to address problems cast in the psychology of the first half 
of the Twentieth Century. They fall short for solving the range of problems cast in the 
emerging view of how people think, learn, and solve problems. The challenge: To extend 
test theory to a broader family of student models, in directions indicated by recent devel- 
opments in cognitive and educational psychology. 



The Evolution of Standard Test Theory 

The conceptual foundations of standard test theory are found in the psychology of 
Charles Spearman (e.g., 1904) and L. L. Thurstone (e.g., 1947). A person is character- 
ized by a small number of real-valued variables, "traits," that drive the probabilities of his 
or her observable responses in specified settings. In educational testing, for example, the 
student model is typically a single variable, say, ability, and the observations are re- 
sponses to test items, assumed conditionally independent given ability. Gulliksen ( 1 96 1 } 
described the central problem of test theory as "the relation between the ability of the in- 
dividual and his observed score on the test" (p. 101 ; emphasis original). 

The paradigm of trait psychology suits the mass educational systems that arose in 
the United States at the turn of the century, and dominate practice yet today. Educators 
were confronted with selection or placement decisions for large numbers of students. 
Resources limited the information they could gather about each student, constrained the 
number of options they could offer, and precluded tailoring programs to individual stu- 
dents once a decision was made. This problem context encourages one to build student 
models around abilities that are few in number, broadly construed, stable over time, ap- 
plicable over wide ranges of students, and discernible by data that are easy to gather and 
analyze. 

Pointing lo Lord and Novick's ( 1 968) Statistical theories of mental test scares as 
a watershed event, Lewis (19X6) stated that "much of the recent progress in test theory 
has been made by treating the study of the relationship between responses to a set of test 
items and a hypothesized trait (or traits) of an individual as a problem of statistical infer- 
ence" (p. 1 1 ). Indeed, we note the appearance of sophisticated estimation procedures 
(e.g.. Bock & Aitkin. 19X1). hierarchical modeling techniques (e.g., Muthe'n & Satorra. 
19X9), approaches for test theory based on missing-data theory (e.g.. Mislevy. 1991 ). and 
theoretical advances into latent-variable modeling (e.g., Holland & Rosenbaum, 1 C )X6). 
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All of these achievements, however, remain largely within the paradigm of trait psychol- 
ogy. 



The Cognitive Revolution 

Recent decades have also witnessed a paradigmatic revolution in the psychology 
of learning and cognition. The emphasis has shifted away from the characterization of 
stable, universally applicable traits, revealed in the same way for all subjects by con- 
strained responses in standardized observational settings, and toward an understanding of 
how individuals organize and update knowledge, how they bring that knowledge to bear 
in meaningful problems: Learners increase their competence not by simply accumulating 
new facts and skills, but by reconfiguring their knowledge structures ("schemas"), by 
automating procedures and "chunking" information to reduce memory loads, and by de- 
veloping strategies and models that tell them when and how facts and skills are relevant. 

The cognitive paradigm shapes conceptions of how to increase students' compe- 
tencies, to help them develop knowledge bases that are increasingly coherent, principled, 
useful, and goal-oriented (Glaser, 1991, p. 26). The implications for test theory that 
follow might be best introduced by an analogy from physics. 

On the Theory of Bridge Design 

A hundred years ago, civil engineers designed bridges in accordance with 
Newton's laws and Euclid's geometry, in the prevailing belief that these models were an 
accurate description of the true nature of the universe. The quantum and reiativistic revo- 
lutions shattered this paradigm. Nevertheless, today's civil engineers still design bridges 
according with the same approach. What's different? 

First, even though the same formulas are employed, they are comprehended from 
the perspective of the new physical paradigm. The formulas through which the bridge is 
designed and constructed are no longer thought of as approximations departing from truth 
only by measurement error, but as engineering tools useful for addressing the problem at 
hand. The bridge is neither so small as to require modeling quantum effects, nor so mas- 
sive or fast moving as to require reiativistic effects. 

Secondly, today's civil engineers work with materials that did not exist a hundred 
years ago, with strengths, flexibilities, and durabilities tailored through modern metal- 



lurgy using, in part, concepts from quantum physics. Even though the same bridge- 
building theory is employed, the materials and the products are improved in ways unan- 
ticipated in the previous paradigm. 

Finally, while civil engineers continue to solve problems that arose under the 
previous paradigms using the still-useful formulas of Newtonian physics, albeit more ef- 
fectively, other scientists and engineers in fields that did not even exist last century are at- 
tacking problems that could not even be conceived of then — problems in superconductiv- 
ity, microchip design, and fusion research, as examples. 

On the Theory of Educational Tests 

I see the same multiple paths of progress for educational test theory, to support 
educational inference and decision-making from the perspective of contemporary psy- 
chology. The role of the statistician is working with the educational and cognitive psy- 
chologist to develop useful student models that express the key aspects of knowledge and 
proficiency, and support defensible and cost-effective statistical inference in practical 
settings. My comments fall into the realms suggested by the preceding analogy. 

First, educational testing for large-scale selection and placement decisions will 
continue to be useful when resources dictate constraints similar to those that originally 
spawned standard test theory. These applications must be re-examined in the conceptual 
framework of the new paradigm — to be re-justified from new premises, revised so that 
they can be, perhaps abandoned if they cannot. That an application falls into the last of 
these categories signals not a failure of test theory to accomplish what it was designed to 
do, but a inadequacy of the conceptual framework from which the decision-making alter- 
natives were derived. 

An example: The Scholastic Aptitude Test (SAT), "a multiple-choice 
examination for the most part, was added to the college Board Program and administered 
first on June 23, 1926" to help colleges select among applicants. The goal was 
"measuring verbal aptitude and ...mathematical aptitude" (Angoff & Dyer, 1971, p. 2), 
purportedly to identify those with high enough trait values to succeed. A strong 
predictive relationship was sufficient to justify the test. A wide diversity of students, 
varying in cultural backgrounds, educational experiences, and personal qualities, can 
obtain the same SAT score, however. Does the same score convey the same information 
about each? On the basis of these scores, should the same inference or the same decision 
be made about all? Maybe, but to justify the program today, one must demonstrate a 
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direct relation between performance and relevant skills; for example, showing for 
example that difficulty of verbal test items depends on features of items linked to 
elements in theories of comprehension (Scheuneman, Gerrit/.. & Embretson, 1991 ). 

Secondly, testing may still emp'oy an overall proficiency measurement model, but 
with different raw materials — i.e., types of observations. Rather than obtaining informa- 
tion with more easily processed multiple-choice items, for example, an assessment may 
require examinees to solve more complex multi-step tasks, to formulate problems rather 
than solve problems presented by the examiner, or to carry out an extended project at 
least partly of the student's own choosing. The rationale, again, is that correlation alone 
is not enough. Students do not increase proficiency by "increasing their trait values," but 
by studying, learning, and practicing particular content and skills; the content and skills 
tests assess are the content and skills teachers teach. Methods of data collection deemed 
inefficient under the trait par-' igm gain currency when viewed as more direct indicators 
of the desired outcomes of learning (see. e.g., Frederiksen & Collins. 1989). In this 
arena, test theory must develop observational models and inferential procedures to con- 
nect trait-based student models with a broader ninge of observations. 

An example: The College Board's Advanced Placement (AP) Stttdio Art test dif- 
fers from standard tests in that students present for evaluation a portfolio of works they 
develop during the course of instruction. An outline of requirements is specified, but. in 
order to elicit evidence about the process of developing proficiency as an artist, students 
are necessarily provided almost unbridled choice in the specific projects they undeitake. 
Together, experts in art and statisticians have developed a framework for evaluating port- 
folios along performance scales; they must continue, using statistical methodology, to re- 
fine systems by which judges can monitor, control, communicate, improve their proce- 
dures for ratings of complex performances. 

Finally, statisticians interested in educational applications must work with psy- 
chologists and educators to develop workable models for applications that are not ad- 
dressed by standard test theory. In this vein are more detailed models of aspects of stu- 
dent knowledge, for the purposes of immediate, short-term educational decisions. The 
important questions become not "How many items did this student answer correctly?" 
hut, in Thompson's ( 19X2) words, "What can this person be thinking so that his actions 
make sense from his perspective'.'". 
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An example of this type is a computerized intelligent tutoring system (ITS; see, 
e.g., Lesgold, Lajoie, Bunzo, & Eggar<, 1988). In an ITS, students learn concepts and 
practice problems while the tutor continuously updates a student model. The status of the 
student model drives short term instructional decisions, for hints, direct instruction, or 
problem selection. The psychology of learning in the domain determines the nature of 
the student model. Test theory, broadly construed, connects observations to the student 
model. Statistical principles serve as the foundation for inference and decision-making. 

Conclusion 

The cognitive revolution in psychology challenges the very premises upon which 
educational testing was founded. Reconceiving test theory, to tackle both old problems 
as viewed from the new paradigm and new problems that did not previously exist, is a 
task thai demands the creative efforts, in concert, of theoreticians, educators, and, I would 
submit, statisticians. 

IV. Allowing examinee choice in exams 

by 

Howard Wainer 3 

There is a growing movement in education to radically change the structure of 
standardized exams. This movement has grown from a dissatisfaction with the results ob- 
tained from the kind of standardized exams currently in use. These exams are typically 
composed of a substantial number (commonly between 50 and 100) of multiple choice 
items 4 . This dissatisfaction spans many areas, but is principally focused on the perceived 
molecular nature of multiple choice items. Many of the complainants express a 
preference for larger, more "authentic" items 5 . 

It has long been understood that a good test must contain enough questions to 
cover fairly the content domain. In his description of an 1845 survey of the Grammar and 
Writing Schools of Boston, Horace Mann argued that 



3 'Hie preparation of this report was made possible through support from the Graduate Record Examination 
Board. Hie author is delighted for the opportunity to acknowledge gratefully this help. 

4 A "multiple choice item" is a question paired with several possible answers. 'ITie most usual task for an 
examinee is to choose the best option from among those offered. 

"Authentic" here is used to mean items that more closely resemble the real world tasks thai the test is 
supposed to be predicting. 
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"... it is clear that the larger the number of questions put to a scholar, the 
better is the opportunity to test his merits. If but a single question is put, 
the best scholar in the school may miss it, though he would succeed in an- 
swering the next twenty without a blunder; or the poorest scholar may 
succeed in answering one question, though certain to fail in Twenty others. 
Each question is a partial test, and the greater the number of questions, 
therefore, the nearer does the test approach to completeness. It is very un- 
certain which face of a die will turn up at the first throw; but if the dice- 
are thrown all day, there will be a great equality in the number of faces 
turned up." 

Despite the force of Mann's argument, pressure continues to build to make tests 
from units that arc larger than a single multiple choice item. Sometimes these units can be 
thought of as aggregations of small items, e.g., testlets (Wainer & Kiely, 1987; Wainer & 
Lewis, 1990); sometimes they are just large items (e.g., essays, mathematical proofs, 
etc.). Large items, by definition, take the examinee longer to complete than do short 
items. Therefore, fewer large items can be completed within the given testing time. 

The fact that an examinee cannot complete very many large items within the allot- 
ted testing time places the test builder in something of a quandary. One must either be 
satisfied with fewer items, and possibly not span the domain of material that is to be ex- 
amined as fully as might have been the case with a much larger number of smaller items, 
or expand the testing time sufficiently to allow the content domain to be well represented. 
Often practicality limits testing time, and so compromises on domain coverage must be 
made. A common compromise is to provide several large items and allow the examinee 
to choose among them. The notion is that in this way the examinee is not placed at a dis- 
advantage by an unfortunate choice of domain coverage by the test builder. 

Allowing examinees to choose the items they will answer presents a difficult set 
of problems. Despite the most strenuous efforts to v rite items of equivalent difficulty, 
some are inevitably more difficult than others. If examinees who choose different items 
are to be fairly compared with one another, a basis 'or that comparison must be estab- 
lished. How might that be done? 
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This question is akin to the question answered by equating in traditional methods 
of test construction in which different forms of a test are prepared 6 and administered at 
random to different segments of the examinee population. 

All methods of equating are aimed at producing the subjunctive score that an ex- 
aminee would have obtained had that examinee answered a different set of items. To ac- 
complish this feat requires that the unobserved item responses are "missing-at-random." 7 
The act of equating means that we believe that the performance that we observe on one 
test form (or item) tells us something about what performance would have been on an- 
other test form (or item). If we know that the procedure by which an item was chosen has 
nothing to do with any specialized knowledge that the student possesses we can believe 
that the mi<;sing responses are missing-at-random. However, if the examinee has a hand 
in choosing the items this assumption becomes considerably less plausible. There is an 
important difference between examinee-chosen data and the data usually used to equate 
alternate forms — the latter have data missing by the choice of the examiner, not the 
examinee. 

To understand this more concretely consider two different construction rules for a 
spelling test. Suppose we have a corpus of 100,000 words of varying difficulty, and we 
wish to ca:ate a 100-item spelling test. From the proportion of the test's items that the ex- 
aminee correctly spells we will infer that the examinee can spell a proportion of the total 
corpus. Two rules for constructing, such a test might be: 

• Missing-at random: We select 100 words at random from the corpus and present them 

to the examinee. In this instance we believe that what we observe is a reasonable 
representation of what we did not observe. 

• Examinee selected: A word is presented at random to the examinee, who then decides 

whether or not to attempt to spell it. After 100 attempts the proportion spelled cor- 
rectly is the examinee's raw score. The usefulness of this score depends crucially 
on the extent to which we believe that examinees' judgments of whether or not 
they can spell particular words are related to actual ability. If there is no relation 
between spelling ability and a priori expectation, then this method is as good as 
missing-at-random. At the other extreme, we might believe that examinees know 

f 'l Isually strenuous efforts are made to make these various Conns as identical to one another in content and 
difficulty as possible. An equaling is considered successful it a fully informed examinee is indifferent as lo 
which form she will receive. 

7 'lhe random assignment of forms to individuals makes the assumption of 'missing-at-random' credible. 
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perfectly well whether or not they can spell a particular word correctly. In this in- 
stance a raw score of l()()9f has quite a different meaning. Thus, if an examinee 
spells 90 words correctly all we know with certainty is that the examinee can spell 
no fewer than 90 words and no more than 99,990. A clue that helps us understand 
how to position our estimate between these two extremes is the number of words 
passed over during the course of obtaining the sample of 100. If the examinee has 
the option of omitting a word, but in fact attempts the first 100 words presented, 
our estimate of that examinee's proficiency will not be very different than that ob- 
tained under 'missing-at-random.' If it takes 50,000 words for the examinee to find 
100 to attempt we will reach quite a different conclusion. If we have the option of 
forcing the examinee to spell some previously rejected words (sampling from the 
unselected population), we can reduce uncertainty due to selection. 

This example should make clear that the mechanism by which items are chosen is 
almost as crucial for correct interpretation as the examinee's performance on those items. 
Is there any way around this problem? How can we compare scores on tests in which all, 
or some, of the items are selected by the examinee? 

A brief example of the size of the possible effects that need to be adjusted for may 
be of help. Fremer, Jackson & McPeek (1968) report on one chemistry examination there 
were two parts. Part I consisted of a set of multiple choice questions that everyone was 
required to answer. Part II allowed the examinee to choose between two large questions. 
The choice divided the examinees into two groups: (A) those who chose to answer ques- 
tion 1 on Part II and (B) those who chose to answer question 2. They found that although 
there was essentially no difference between these two groups in their performance on Part 
I there was an enormous difference on Part II. The mean scores are shown in Table 1 
(below). 

Table 1 





Group A 


Group B 


Part I 


11.7 


11.2 


Part II 






Question 1 


8.2 




Question 2 




2.7 



How should we interpret these results? We might believe that whatever is being tested in 
Part II is quite different than in Part I and that those who chose Question 2 are not as 



good on it as those who chose Question 1. A second possibility is that Question 2 is more 
difficult than question 1. If we believe the latter, fairness requires that we somehow adjust 
for it. How? An immediate response might be to use the performance on Part I to adjust 
the scores on Part II. This is sensible only if there is a strong relationship between what is 
tested in Part I and what is tested in Part II. To the extent that they are different the ad- 
justment will be illegitimate. Yet, if they are not different, why bother with Part II at all? 
The ironic conclusion seems to be that if choice is justified we have no good way to make 
comparisons among the groups formed by these choices. We can, however, make fair 
comparisons among choice sections when the use of choice was unnecessary. 

Alas, this sort of argument seems to have fallen largely on deaf ears. Choice op- 
tions are being implemented within more and more large-scale testing programs. It falls 
now to us to figure out how to allow choice while at the same time adjusting the scores 
on choice items of potentially very different difficulty to assure the equivalence (and 
hence the fairness) of the different test forms built by examinees. 

V. Some statistical issues facing NAEP 

by 

Eugene G. Johnson 8 

The National Assessment of Educational Progress (NAEP) is an ongoing, con- 
gressionally mandated survey designed to measure educational achievement and changes 
in that achievement over time for US. students of specified ages and grades as well as for 
subpopulations defined by demographic characteristics and by specific background and 
experiences. Since its inception in 1969, students have been assessed in the subject areas 
of reading, mathematics, science, writing, social studies, civics, US. history, geography, 
citizenship, literature, music, career development, art, and computer competence. Many 
subject areas are reassessed periodically to measure trends over time. The assessment has 
always included nationally representative samples of students, drawn via complex multi- 
stage probability sample designs. These samples permit the measurement of nationally 
and regionally defined subpopulations of students but do not allow the reliable reporting 
of state level results. For the 1990 and the 1992 assessments, congress authorized volun- 
tary state level assessments, in addition to the national assessments. For this purpose, a 



S A portion of this document is based on work performed tor the National Center for Hducation Statistics, 
Office of Hducational Research and Improvement. 
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distinct probability sample is drawn within each participating state to provide individual 
state representative data. An overview of the NAEP design can be found in Johnson 
(1992; and Rust and Johnson (1992). 

Statistical issues currently facing NAEP fall into two general categories. The first 
category consists of issues related to the effects of nonresponse on estimates of subpopu- 
lation achievement. The second category consists of issues related to the use of assess- 
ment methodology other than multiple-choice questions. 

Effects of nonresponse 

The NAEP design selects schools and then students within selected schools for 
participation in NAEP. At each of these stages of selection, participation is voluntary. 
Unfortunately, as testing within schools has become more prevalent, the difficulty in ob- 
taining voluntary participation of the selected schools has increased. In addition to ex- 
pending considerable effort in attempting to convert refusing schools, NAEP handles 
school refusals by providing substitutes for nonparticipating schools that could not be 
converted. However, even though the characteristics of the substitute schools are 
matched as closely as possible to those of the initially selected schools in terms of minor- 
ity enrollment, urbanicity, and median household income, substitution does not eliminate 
bias due to the nonparticipation of the initially selected schools. 

In addition to school nonresponse, there is also an issue of student nonresponse. 
NAEP has handled this type of nonresponse by inflating the sampling weights of the re- 
sponding students to maintain totals within nonresponse adjustment classes within each 
primary sampling unit. Nonresponse bias thus exists to the extent that the distributional 
characteristics of the nonrespondents and the respondents differ within each nonresponse 
adjustment class. Evidence exists (Rust & Johnson, 1992; Rogers, Folsom, Kalsbeek, & 
Clemmer, 1977) that the vast majority of nonrespondents to NAEP assessments is the 
same as the respondents in terms of performance and other characteristics. Nevertheless, 
there is reason to believe that some proportion of the nonrespondents are less adequately 
handled by the nonresponse adjustment procedures. 

The adequacy of the nonresponse adjustment procedures is an issue, particularly 
for the state-level assessments where a major goal is the comparison of performance be- 
tween states. Such a comparison is obviously affected by the level and type of nonre- 
sponse, and the stability of nonresponse across states. NAEP is currently considering 
model-based procedures that attempt to quantify the potential dependence of results on 
the magnitude and characteristics of nonresponse. 
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Effects of assessment methodology 

NAEP has always been at the forefront of assessment methodologies. For the 
1992 assessment, for example, more than 4Q7< of the questions within the subject areas of 
reading and mathematics are free-response items (including items requiring extended re- 
sponses, such as essays) and all of the items i.i the 1992 writing assessment require the 
writing of an essay. Further non-multiple choice assessment techniques include oral in- 
terviews, examinee choice questions, evaluation of school-based writing, and assessments 
of a student's ability to carry out concrete tasks (see Mullis, 1992). Each of these rela- 
tively nonstandard assessment techniques present statistical and psychometric issues that 
need to be solved. 

For example, many of the so called authentic tests involve a specific task that the 
student is to perform coupled with a series of questions. The task might be to read a long 
passage or to conduct a science experiment. The questions range from multiple choice, to 
short answer, to extended responses requiring one or more paragraphs. The non-multiple 
choice questions are scored by trained judges. Since the questions are all related to the 
same task, a commonly made, and key, assumption that the items are locally independent 
is likely to be violated. (Local independence means that, conditional on a student's ability 
level, the response probabilities of any pair of items are independent.) 
Since each task could be (perhaps) approached in a variety of ways, and since the mech- 
anism used to solve the problem is of interest in authentic testing, statistical mechanisms 
are needed to identify subgroups of students who approach a task in similar ways. 
Because the responses to the tasks are rated by judges, work needs to be done to establish 
and hopefully account for the effects of variability in the judgment process on the ratings 
provided. 



V. Conclusion 

This paper is different from most that find their way onto these pages, in that it is 
description of problems rather than the more common structure that includes both a 
statement of the problem and at least an initial solution. In this way it h in the spirit of 
Hilbert's famous paper on "Mathematical Problems." We hope that the response to our 
statement of these problems is as successful at eliciting solutions from the readers as 
Hilbert's has been. 
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